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Abstract 

In many recent applications, data is plentiful. By now, we have a rather clear understanding of how more data can 
be used to improve the accuracy of learning algorithms. Recently, there has been a growing interest in understanding 
how more data can be leveraged to reduce the required training runtime. In this paper, we study the runtime of 
learning as a function of the number of available training examples, and underscore the main high-level techniques. 
We provide some initial positive results showing that the runtime can decrease exponentially while only requiring a 
polynomial growth of the number of examples, and spell-out several interesting open problems. 

1 Introduction 

Machine learning are now prevalent in a large range of scientific, engineering and every-day tasks, ranging from 
analysis of genomic data, through vehicle and aircraft control to locating information on the web and providing users 
with personalized recommendations. Meanwhile, our world has become increasingly "digitized" and the amount of 
data available for training is dramatically increasing. By now, we have a rather clear understanding of how more data 
can be used to improve the accuracy of learning algorithms. In this paper we study how more data can be beneficiary 
for constructing more efficient learning algorithms. 

Roughly speaking, one way to show how more data can reduce the training runtime is as follows. Consider learning 
by finding a hypothesis in the hypothesis class that minimizes the training 
error. In many situations, this search problem is computationally hard. One can 
circumvent the hardness by replacing the original hypothesis class with a dif- 
ferent (larger) hypothesis class, such that the search problem in the larger class 
is computationally easier (e.g., the search problem in the new hypothesis class 
reduces to a convex optimization problem). On the flip side, from the statistical 
point of view, the estimation error in the new hypothesis class might be larger 
than the estimation error in the original class, and thus, with a small number 
of examples, learning the larger class might lead to overfitting even though the 
same amount of examples suffices for the original hypothesis class. However, 
having more training examples keeps the overfitting in check. In particular, if 
the number of extra examples we need for learning the new class is only polyno- 
mially larger than the original number of examples, we end up with an efficient 
algorithm for the original problem. If, however, we don't have those extra ex- 
amples, our only option is to learn the original hypothesis class, which may be computationally harder. 

The goal of this paper is to present a formal model for studying the runtime of learning algorithms as a function of 
the available number of examples. After defining the formal model, we present a binary classification learning problem 
for which we can provably (based on standard cryptographic assumption) demonstrate an inverse dependence of the 
runtime on the number of examples. While there have been previous constructions which demonstrated a similar phe- 
nomenon, assuming the existence of a "perfect" hypothesis, we show this in the much more natural agnostic model of 




Figure 1: The Basic Approach 
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learning. A possible criticism is that our learning problem is still rather synthetic. We continue with presenting several 
learning problems, which arise in natural settings, that have more efficient algorithms by relying on the availability of 
more training data. Some of these examples are based on the intuition of Figure 1, but some are also based on other 
ideas and techniques. However, for all these problems, the analysis is based on upper bounds without having matching 
lower bounds. This raises several interesting open problems. 

1.1 Related Work 

[6] were the first to jointly study the computational and sample complexity, and to show that a tradeoff between 
runtime and sample size exists. In particular, they distinguish between the information theoretic sample complexity 
of a class and its computational sample complexity, the latter being the number of examples needed for learning the 
class in polynomial time. They presented a learning problem which is not efficiently learnable from a small training 
set, and is efficient learnable from a polynomially larger training set. [11] showed that for a concept class composed 
of 1-decision-lists over {0, 1}™, which can be learned inefficiently using 0(1) examples, no algorithm can learn it 
efficiently using o(n) examples, and there is an efficient algorithm using fi(n) examples. The construction was also 
extended to k decision-lists, k > 1. with larger gaps. 

In contrast to [6, 11], which focused on learning under the realizable case (namely, that the labels are generated 
by some hypothesis in the class), we mostly focus on the more natural agnostic setting, where any distribution over 
the example domain is possible, and there may be no hypothesis h in our class that never errs. This is not just a 
formality - in both [6, 11], the construction crucially relies on the fact that the labels are provided by some hypothesis 
in the class. In terms of techniques, we rely on the cryptographic assumption that one-way permutations exist, which 
is the same assumption as in [11] and similar to the assumption in [6]. We note that cryptographic assumptions are 
common in proving lower bounds for efficient learnability, and in some sense they are even necessary [2]. However, 
our construction is very different. For example, in both [6, 11], revealing information on the identity of the "correct" 
hypothesis is split among many different examples. Therefore, efficient learning is possible after sufficiently many 
examples are collected, which then allows us to return the "correct" hypothesis. In our agnostic setting, there is no 
"correct" hypothesis, so this kind of approach cannot work. Instead, our efficient learning procedure computes and 
returns an improper predictor, which is not in the hypothesis class at all. 

A potential weaknesses of our example, as well as the example given in [6], is that our hypothesis class does not 
consist of "natural" hypotheses. The class employed in [1 1] is more natural, but it is also a very carefully constructed 
subset of decision lists. The goal of the second part of the paper is to demonstrate gaps (though based on upper bounds) 
for natural learning problems. 

Another contribution of our model is that it captures the exact tradeoff between sample and computational complex- 
ity rather then only distinguishing between polynomial and non-polynomial time, which may not be refined enough. 
Bottou and Bousquet [5] initiated a study on learning in the data laden domain - a scenario in which data is plentiful 
and computation time is the main bottleneck. This is the case in many real life applications nowadays. Shalev-Shwartz 
and Srebro [13] continued this line of research and showed how for the problem of training Support Vector Machines, a 
joint statistical-computational analysis reveals how the runtime of stochastic-gradient-descent can potentially decrease 
with the number of training examples. However, this is only demonstrated via upper bounds. More importantly, the ad- 
vantage of having more examples only improves running time by constant factors. In this paper, we will be interested 
in larger factors of improvement, which scale with the problem size. 

2 Formal Model Description 

We consider the standard model of supervised statistical learning, in which each training example is an instance-target 
pair and the goal of the learner is to use past examples in order to predict the targets associated with future instances. 
For example, in spam classification problems, an instance is an email message and the target is either +1 ('spam') 
or —1 ('benign'). We denote the instance domain by X and the target domain by y. A prediction rule is a mapping 
h : X — > y. The performance of a predictor h on an instance-target pair, (x, y) e X x y, is measured by a loss 
function £(h(x), y). For example, a natural loss function for classification problems is the 0-1 loss, £(h(x), y) = 1 if 
y 7^ /i(x) and otherwise. 
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A learning algorithm, A, receives a training set of m examples, S m = ((xi, j/i), . . . , (x m , y m )), which are as- 
sumed to be sampled i.i.d. from an unknown distribution V over the problem domain Z C X x y. Using the training 
data, together with any prior knowledge or assumptions about the distribution V, the learner forms a prediction rule. 
The predictor is a random variable and we denote it by A(S m ). The goal of the learner is to find a prediction rule with 
low generalization error (a.k.a. risk), defined as the expected loss: 

err(/i) =' E (x , y) ^ v [£(h(x.),y)} . 

The well known no-free-lunch theorem tells us that no algorithm can minimize the risk without making some prior 
assumptions on T>. Following the agnostic PAC framework, we require that the learner will find a predictor whose risk 
will be close to m£heu err(ft), where % is called a hypothesis class (which is known to the learner). 

We use en(A(S m )) to denote the expected risk of the predictor returned by A, where expectation is with respect 
to the random choice of the training set. We denote by time(A, m) the upper bound on the expected runtime 1 of the 
algorithm A when running on any training set of m examples. The main mathematical object that we propose to study 
is the following: 

T-H.e{ m ) = min{t : 3 A s.t. V V, time(A, m) < t A err(A(m)) < inf err(ft) + e} , (1) 

hew. 

where when no t satisfies the above constraint we set T-u,e{m) = oo. Thus, T^ e (m) measures the required runtime 
to learn the class % with an excess error of e given a budget of m training examples. Studying this function can show 
us how more data can be used to decrease the required runtime of the learning algorithm. The minimum value of m 
for which T^ ;e (m) < oo is the information-theoretic sample complexity. This corresponds to the case in which we 
ignore computation time. The other extreme case is the value of T-u^(oo). This corresponds to the data laden domain, 
namely data is plentiful and computation time is the only bottleneck. 

We continue with few additional definitions. In general, we make no assumptions on the distribution T>. However, 
we sometime refer to the realizable case, in which we assume that the distribution T> satisfies min/ ie ^ err(/i) = 0. The 

empirical error on the training examples, called the training error, is denoted by errs(h) = f ^ ^(M x i)i Vi)- 
A common learning paradigm is Empirical Risk Minimization, denoted ERM^, in which the learner can output any 
predictor in W that minimizes crr s (h). A learning algorithm is called proper if it always returns a hypothesis from H. 
Throughout this paper we are concerned with improper learning, where the returned hypothesis can be any efficiently 
computed function h from instances x to labels y. Note that improper learning is just as useful as proper learning for 
the purpose of deriving accurate predictors. 

2.1 A Warm-up Example 

To illustrate how more data can reduce runtime, consider the problem of learning the class of 3-term disjunctive normal 
form (DNF) formulas in the realizable case. A 3-DNF is a Boolean mapping, h : {0, l} d — > {0, 1}, that can be written 
as h(x) = Ti(x) V T 2 (x) V Ts(x), where for each i, Tj(x) is a conjunction of an arbitrary number of literals, e.g. 
Tj(x) =iiA ~^x 3 A X5 A ^X7. 

Since the number of 3-DNF formulas is at most 3 3d , it follows that the information theoretic sample complexity 
is 0(d/e). However, it was shown [10, 9] that unless RP=NP, the search problem of finding a 3-DNF formula which 
is (approximately) consistent with a given training set cannot be performed in poly(d) time. On the other hand, we 
will show below that if m = 8(c? 3 /e) then T-^ j€ (m) = poly(d/e). Note that there is no contradiction between the last 
two sentences, since the former establishes hardness of proper learning while the latter claims feasibility of improper 
learning. 

To show the positive result, observe that each 3-DNF formula can be rewritten as A U £t 1 ,v£T 2 ,w£T 3 (u VdVio) 
for three sets of literals T 1} T 2 ,T 3 . Define -0 : {0, l} d -> {0, l} 2 ( 2d ) 3 such that for each triplet of literals u, v, w, 
there are two indices in ip(x), indicating if u V v V w is true or false. Therefore, each 3-DNF can be represented 
as a single conjunction over ^>(x). As a result, the class of 3-DNFs over x is a subset of the class of conjunctions 
over V>( x )- The search problem of finding an ERM over the class of conjunctions is polynomially solvable (it can be 

'To prevent trivialities, we also require that the runtime of applying A(S m ) on any instance is at most time(A, m). 
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cast as a linear programming, or can be solved using a simple greedy algorithm). However, the information theoretic 
sample complexity of learning conjunctions over 2(2d) 3 variables is 0(d 3 /e). We conclude that if m = <d(d 3 /e) then 

t-hA 171 ) = poiy^A)- 

It is important to emphasize that the analysis above is not satisfactory for two reasons. First, we do not know if 
it is not possible to improperly learn 3-DNFs in polynomial time using 0(d/e) examples. All we know is that the 
ERM approach is not efficient. Second, we do not know if the information theoretic sample complexity of learning 
conjunctions over ?/>(x) is fl(d 3 /e). Maybe the specific structure of the range of ip yields a lower sample complexity. 

But, if we do believe that the above analysis indeed reflects reality, we obtain two points on the curve T^ je (m). 
Still, we do not know how the rest of the curve looks like. This is illustrated below. 
. 3-DNF 



TuA m ) 






Samples 


Time 


ERM over 3-DNF 


d/e 


not poly(d) 


ERM over Conjunctions 


d 3 /e 


poly(d/e) 



^Conjunction 
> 



3 Formal derivation of gaps 

In this section, we formally show a learning problem which exhibits an inverse dependence of the runtime on the 
number of examples. As discussed in the Subsection 1 . 1 , it is distinguished from previous work in being applicable to 
the natural agnostic setting, where we do not assume that a perfect hypothesis exist. Since this assumption was crucial 
in all previous works, the construction we use is rather different. 

To present the result, we will need the concept of a one-way permutation. Intuitively, a one-way permutation over 
{0, 1}™ is a permutation which is computationally hard to invert. More formally, let U n denote the uniform distribution 
over {0, l} n , and let {0, 1}* denote the set of all finite bit strings. Then we have the following definition: 

Definition 1. A one-way permutation P : {0, 1}* H> {0, 1}* is a function which for any n, maps {0, 1}" to itself; 
there exists an algorithm for computing P(x), whose runtime is polynomial in the length of x; and for any (possibly 
randomized) polynomial-time algorithm A and any polynomial p(n) over n, Pr x ^ (A(PCx)) = x) < —f^? for 
sufficiently large n. 

It is widely conjectured that such one-way permutations exist. One concrete candidate is the RSA permutation 
function, which treats x € {0, 1}™ as a number in {0, . . . , 2" — 1}, and returns P(x) = x 3 mod N, where N is a 
product of two "random" primes of length n such that (p — l)(q ~ 1) does not divide 3. However, since the existence 
of such a one-way permutation would imply P ^ NP, there is no formal proof that such functions exist (see [7] for 
this and related results). 

Theorem 1. There exists an agnostic binary classification learning problem over X — {0, l} 2 ™ and y — {0, 1} with 
the following properties: 

• // is inefficiently learnable with sample size m = 0(1/ e), and running time 0(2™ + m). 

• Assuming one-way permutations exist, there exist no polynomial-time algorithm based on a sample of size 
0(log(n)). 

• It is efficiently learnable with a sample of size m = 0(n/e 2 ). Specifically, the training time is 0(m), resulting 
in an improper predictor whose runtime is 0(m 3 ). 

The theorem implies that in the reasonable regime where 1/e < log(n) < n/e 2 , we really get an inverse depen- 
dence of the runtime on the training size. The theorem is illustrated below: 
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T-hA 771 ) 




I 1 1 1 ► m 

I log(n) £ 

To prove the theorem, we will define the following learning problem. Let X — {0, l} 2 ™ and y = {0, 1}. We will treat 
each x e X as a pair (r, s), where r refers to the first n bits in x, and s to the last n bits. Let (r, r') = Tir i moc ' ^ 

to denote inner product over the field GF(2). Let P be a one-way permutation. Then the example domain is the 
following subset of X x y-. 

Z = {((r,s),6) : r,s£ {0, l} n , (P _1 (s), r) = &}. 

The loss function we use is simply the 0-1 loss, £(h(x),y) = lf l ( x \^. y . 

The hypothesis class T~L consists of randomized functions, parameterized by {0, 1}™, and defined as follows, where 
U\ is the uniform distribution on {0, 1}: 

H=L(r, S ) = S r P(X) : xe{0,irl 

I I o ~ IA\ otherwise I 

Learning with 0(log(n)) Samples is Hard 

We consider the following "hard" set of distributions {£> x }, parameterized by x G {0, 1}": each 2? x is a uniform 
distribution over all ((r, P(x)), (x, r)). Note that there are exactly 2™ such examples, one for each choice of r e 
{0, 1}™. Also, note that for any such distribution 2? x , inf/ ie -^ err(/i) = 0, and this is achieved with the hypothesis h x . 

First, we will prove that with a sample size m = 0(log(n)), any efficient learner fails on at least one of the 
distributions 2? x . To see this, suppose on the contrary that we have an efficient distribution-free learner A, that 
works on all 2? x , in the sense of seeing m — 0(log(n)) examples and then outputting some hypothesis h such that 
h((r, P(x))) = (x, r) with even some non-trivial probability (e.g. at least 1/2 + l/poly(n)). We will soon show 
how we can use such a learner A, such that in probability at least l/poly(n), we get an efficient algorithm A', which 
given just P(x) and r, outputs (x, r) with probability at least 1/2 + l/poly(n). However, by the Goldreich-Levin 
Theorem ([7], Theorem 2.5.2), such an algorithm can be used to efficiently invert P, violating the assumption that P 
is a one-way permutation. 

Thus, we just need to show how given P(x), r, we can efficiently compute (x, r) with probability at least l/poly(n). 
The procedure works as follows: we pick m = 0(log(n)) vectors ri, . . . , r m uniformly at random from {0, 1}™, and 
pick uniformly at random bits ,b m . We then apply ourlearning algorithm A over the examples {((r^, P(x)), bi)} 7 ^ 1 , 

getting us some predictor h! . We then attempt to predict (x, r) by computing h'((x' , P(x))). 

To see why this procedure works, we note that with probability of l/2 m = l/poly(n), we picked values for 
b\, . . . , b m such that 6, = (x, r s ) for all i. If this event happened, then the training set we get is distributed like m i.i.d. 
examples from P x . By our assumption on A, and the fact that inf/j err(/i) = 0, it follows that with probability at least 
l/poly(n), A will return a hypothesis which predicts correctly with probability at least 1/2 + l/poly(n), as required. 

Inefficient Distribution-Free Learning Possible with 0(1/ e) Samples 

Ignoring computational constraints, we can use the following simple learning algorithm: given a training sample 
{(rj, Sj), bj}^, find the most common value s' among si, . . . , s m , compute x' = P _1 (s') (inefficiently, say by 
exhaustive search), and return the hypothesis h x > . 

To see why this works, we will need the following lemma, which shows that if /i x has a low error rate, then 
s = P(x) is likely to appear frequently in the examples (the proof appears in Appendix A). 
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Lemma 1. For any distribution V over examples, and any fixed x e {0, 1}", it holds that Pr s (s = P(x)) = 
1 - 2crr(/i x ). 

Suppose that /i x is the hypothesis with a smallest generalization error in the hypothesis class. We now do a case 
analysis: if err(/i x ) > 1/2 — e, then the predictor h x < returned by the algorithm is almost as good. This because the 
probability in the lemma statement cannot be negative, so for any x' (and in particular the one used by the algorithm), 
we have err (h x >) < 1/2. 

The other case we need to consider is that err(/i x ) < 1/2 — e. By the lemma, s = P(x) is the value of s most 
likely to occur in the sample (since /i x is the one with smallest generalization error), and its probability of being picked 
is at least 1 — 2 * (1/2 — e) = e. This means that after 0(1/ e) examples, then with overwhelming probability, the s' 
we pick is such that Pr s (s = s) — Pr s (s = s') < e/2. But again by the lemma, this implies that err(/i x /) — err(/i x ) is 
at most e/4. So h x > that our algorithm returns is an e/4-optimal classifier as required. 

Efficient Distribution-Free Learning Possible with 0(n/e 2 ) Samples 

We will need the following lemma, whose proof appears in Appendix A: 

Lemma 2. Let V be some distribution over {0, 1}", and suppose we sample m! vectors Y\ , . . . , r m > from that distri- 
bution. Then the probability that a freshly drawn vector r is not spanned by ri, . . . , r m > is at most n/m'. 

We use a similar algorithm to the one discussed earlier for inefficient learning. However, instead of finding the 
most common s', computing x' = P _1 (s') and returning h x i, which cannot be done efficiently, we build a predictor 
which is at most e worse than h x >, and doesn't require us to find x' explicitly. 

To do so, let {((r^ , s ij ),b ij )}jL 1 be the subset of examples for which = s'. By definition of Z, we know that 
for any such example, (x', r^.) = (P~ 1 (s'),r ij ) = b ij . In other words, this gives us a set of values r i± , . . . , , , for 
which we know (x', } , . . . , (x' , r j , ) . As a consequence, for any r in the linear subspace spanned by r^ , . . . , r j , , 
we can efficiently compute (x' , r) . Let B denote this subspace. Then our improper predictor works as follows, given 
some instance (r, s): 

• If s = s' and r <G B, output (x', r) (note that this is the same output as h x >, by definition). 

• If s ^ s', output a random bit (note that this is the same output as h x >, by definition of h x >). 

• If s = s' and r ^ B, output a bit uniformly at random. 

Note that checking whether reB can always be done in at most 0(m' 3 ) < (9(m 3 ) time, via Gaussian elimination. 

Now, we claim that the probability of the third case happening is at most e/2. If this is indeed true, then our 
improper predictor is only e/2 worse (in terms of generalization error) from h x >, which based on the argument in the 
previous section, is already e-close to optimal. 

So let us consider the possibility that s = s' and r ^ B. If Pr s (s = s') < e, we are done, so let us suppose that 
Pr s (s = s') > e. This means that m' is unlikely to be much smaller than em. More precisely, by the multiplicative 
Chernoff bound, Pr(m' < em/2) < cxp(— em/8). Also, conditioned on some fixed to' > em/2, Lemma 2 assures 
us that Pr(r ^ B\s = s') < n/m' < 2n/em. Overall, we get the following (the probabilities are over the draw of the 
training set and an additional example ((r, s), 6)): 

Pr(s = s',r<£B) = Pr(s = s', r £ B, m < em/2) + Pr(s = s', r £ B, m > em/2) 

< Pr(m' < em/2) + Pr(TO ; > em/2, r (/ B\s = s') 

oo 2 n 

< exp(-em/8) + Pr(ra') Pr(r (/ B\s = s', to') < exp(-em/8) + — . 

m'— em/2 

By taking to = 0(n/e 2 ) examples, we can ensure this to be at most order e. 
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4 Gaps for natural learning problems 



In this section we collect examples of natural learning problems in which we conjecture there is an inverse dependence 
of the training time on the sample size. Some of these examples already appeared explicitly in previous literature, but 
most are new, unpublished, or did not appear in such an explicit form. We base our inverse dependence conjecture 
on the current best known upper bounds. Of course, an immediate open question is to show matching lower bounds. 
However, our main goal here is to demonstrate general techniques of how to reduce the training runtime by requiring 
more examples. 

4.1 Agnostically Learning Preferences 

Consider the set [d] = {1, . . . , d}, and let X = [d] x [d] and y = {0, 1}. That is, each example is a pair (i, j) and the 
label indicates whether i is more preferable to j. 

Consider the hypothesis class of all permutations over [d] which can be written as H = {/i w (i, j) = l[to» > Wj] : 
w e R d }. The loss function is the 0-1 loss. Note that each hypothesis in % can be written as a Halfspace: /i w (i, j) = 
sign((w,e 4 — e 7 )). Therefore, in the realizable case (namely, exists h E H which perfectly predicts the labels of 
all the examples in the training set), solving the ERM problem can be performed in polynomial time. However, in 
the agnostic case, finding a Halfspace that minimizes the number of mistakes is in general NP hard. The sample 
complexity of agnostically learning a d-dimensional Halfspace is 0(d/e 2 ) and we therefore obtain that with a non- 
efficient algorithm, it is possible to learn using 0(d/e 2 ) examples. 

On the other hand, in the following we show that with m = Q(d 2 /e 2 ) it is possible to learn preferences in time 
0(m). The idea is to define the hypothesis class of all Boolean functions over X, namely, H\ = {H(i,j) = Mij : 
M e {0, l} d2 }- Clearly, H C H\. In addition, \H\ \ = 2 d2 and therefore the sample complexity of learning Hi using 
the ERM rule is 0(d 2 /e 2 ). Last, it is easy to verify that solving the ERM problem can be easily done in time 0{m). 
So, overall, we obtain the following: 





Samples 


Time 


ERM over U 
ERM over "Hi 


d/e 2 
d 2 /e 2 


not poly(d) 
d 2 /e 2 



4.2 Agnostic Learning of Kernel-based Halfspaces 

We now consider the popular class of kernel-based linear predictors. In kernel predictors, the instances x are mapped 
to a high-dimensional feature space ip{x), and a linear predictor is learned in that space. Rather than working with 
V>( x ) explicitly, one performs the learning implicitly using a kernel function fc(x, x') which efficiently computes inner 
products (ip(x),ip(x')) . 

Since the dimensionality of the feature space may be high or even infinite, the sample complexity of learning 
Halfspaces in the feature space can be too large. One way to circumvent this problem is to define a slightly different 
concept class by replacing the non-continuous sign function with a Lipschitz continuous function, <f> : R — > [0, 1], 
which is often called a transfer function. For example, we can use a sigmoidal transfer function </> s ; g (a) = 1/(1 + 
exp(— 4ia)), which is a i-Lipschitz function. The resulting hypothesis class is H s i g = {x i-> s i g ((w, ^>(x))) : 
||w||2 < 1}, where we interpret the prediction s ; g ((w, ^(x))) e [0, 1] as the probability to predict a positive label. 
The expected 0-1 loss then amounts to £(w, (x, y)) = \y — </> s ; g ((w, ^>(x)))|. 

Using standard Rademacher complexity analysis (e.g. [3]), it is easy to see that the information theoretic sample 
complexity of learning % is 0(L 2 /e 2 ). However, from the computational complexity point of view, the ERM problem 
amounts to solving a non-convex optimization problem (with respect to w). Adapting a technique due to [4] it is 

possible to show that an e-accurate solution to the ERM problem cam be calculated in time exp log(^)^. 

The idea is to observe that the solution can be identified if someone reveals us a subset of (L/e) 2 non-noisy examples. 
Therefore we can perform an exhaustive search over all (L/e) 2 subsets of the m examples in the training set and 
identify the best solution. 

In [12], a different algorithm has been proposed, that learns the class H s i g using time and sample complexity of at 
most exp (O (L log(^))). That is, the runtime of this algorithm is exponentially smaller than the runtime required to 
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solve the ERM problem using the technique described in [4], but the sample complexity is also exponentially larger. 
The main idea of the algorithm given in [12] is to define a new hypotheses class, "Hi = {x i-> (w, tf)(ip(x))) : ||w|| 2 < 
B}, where B — 0((L/e) L ) and ip is a mapping function for which 



<^(V(x)),V#(x'))) = 



2-(V(x),V(x')) 2-fc(x,x') 



While it is not true that H C Hi, it is possible to show that Hi "almost" contains H in the sense that for each h GH 
there exists hi € Hi such that for all x, \h(x) — hi(x)\ < e. The advantage of Hi over H is that the functions in Hi 
are linear and hence the ERM problem with respect to "Hi boils down to a convex optimization problem and thus can 
be solved in time poly(m), where m is the size of the training set. In summary, we obtain the following 





Samples 


Time 


ERM over H 
ERM over "Hi 


L 2 /e 2 
poly(exp(i log(f))) 


poly(exp(^log(f))) 
poly (cxp (L log(f))) 



4.3 Additional Examples 

In Appendix B we list additional examples of inverse dependence of runtime on sample size. These examples deal 
with other learning settings like online learning and unsupervised learning. These examples are interesting since they 
show other techniques to obtain faster algorithms using a larger sample. For example, we demonstrate how to use 
exploration for injecting structure into the problem, which leads better runtime. The price of the exploration is the 
need of a larger sample. For the unsupervised setting, we recall an existing example which shows polynomial gap for 
learning the support of a certain sparse vector. 



5 Discussion 

In this paper, we formalized and discussed the phenomena of an inverse dependence between the running time and 
the sample size. While this phenomena has also been discussed in some earlier works, it was under a restrictive 
realizability assumption, that a perfect hypothesis exists, and the techniques mostly involved finding this hypothesis. 
In contrast, we frame our discussion in the more modern approach of agnostic and improper learning. 

In the first half of our paper, we provided a novel construction which shows such a tradeoff, based on a crypto- 
graphic assumption. While the construction indeed has an inverse dependence phenomenon, it is not based on a natural 
learning problem. In the second half of the paper, we provided more natural learning problems, which seem to have 
this phenomenon. Some of these problems were based on the intuition described in the introduction, but some were 
based on other techniques. However, the apparent inverse dependence in these problems is based on the assumption 
that the currently available upper bounds have matching lower bounds, which is not known to be true. Thus, we cannot 
formally prove that they indeed become computationally easier with the sample size. 

Thus, a major open question is finding natural learning problems, whose required running time has provable 
inverse dependence with the sample size. We believe the examples we outlined hint at the existence of such problems, 
and provide clues as to the necessary techniques. Other problems are finding additional examples where this inverse 
dependence seems to hold, as well as finding additional techniques for making this inverse dependence happen. The 
ability to leverage large amounts of data to obtain more efficient algorithms would surely be a great asset to any 
machine learning application. 



References 

[1] A. Amini and M. Wainwright. High-dimensional analysis of semidefinite relaxations for sparse prinicpal components. Annals 
of Statistics, 37(5B):2877-2921, 2009. 

[2] B. Applebaum, B. Barak, and D. Xiao. On basing lower-bounds for learning on worst-case assumptions. In FOCS, 2008. 



8 



[3] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine 
Learning Research, 3:463-482, 2002. 

[4] S. Ben-David and H. Simon. Efficient learning of linear perceptrons. In NIPS, 2000. 

[5] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In NIPS, 2008. 

[6] S. Decatur, O. Goldreich, and D. Ron. Computational sample complexity. SI AM Journal on Computing, 29, 1998. 
[7] O. Goldreich. Foundations of Cryptography. Cambridge University Press, 2001. 

[8] S. Kakade, S. Shalev-Shwartz, and A. Tewari. Efficient bandit algorithms for online multiclass prediction. In International 
Conference on Machine Learning, 2008. 

[9] M. Kearns, R. Schapire, and L. Sellie. Toward efficient agnostic learning. Machine Learning, 17: 1 15-141, 1994. 

[10] L. Pitt and L. Valiant. Computational limitations on learning from examples. Journal of the ACM, 35(4):965-984, October 
1988. 

[11] R. Servedio. Computational sample complexity and attribute-efficient learning. J. ofComput. Syst. Sci., 60(1): 161-178, 2000. 
[12] S. Shalev-Shwartz, O. Shamir, and K. Sridharan. Learning kernel-based halfspaces with the zero-one loss. In COLT, 2010. 
[13] S. Shalev-Shwartz and N. Srebro. Svm optimization: Inverse dependence on training set size. In ICML, 2008. 



9 



A Technical Results 



A.l Proof of Lemma 1 

Using the definition of Z and /i x , we have 

l-err(/i x )= Pr (6 = /i x (r,s)) 

((r,s),6)~X> 

= Pr(s = P(x)) Pr(> = /i x (r, s)|s = P(x)) + Pr(s ^ P(x)) Pr(6 = /i x (r,s)|s 7^ P(x)) 
= Pr(s = P(x)) * 1 + Pr(s ^ P(a;)) * ^ = ^(Pr(s = P(x)) + 1). 
Rearranging, we get the result. 

A. 2 Proof of Lemma 2 

Let pfc denote the probability that after drawing ri, . . . , r^, i.i.d., an independently drawn r-fc+i is not spanned by 
ri, . . . , rfc. Also, let B k be a Bernoulli random variable with parameter p k . Whenever B k = 1, the dimensionality of 
the subspace spanned by the vectors we drew so far increases by 1. Since we are in an n-dimensional space, we must 
have B\ + . . . + B m > < n with probability 1. In particular, we have 

n>E[Pi + ... + B m ,} =pi + ...+p m >- 
Also, for any k < to', by the assumption that the vectors are drawn i.i.d., we have 

P'm = Pr(rw+i £ span(n,...,r m /)) < Pr(r m /+i £ span(n, . . . , r k )) 
= Pr(r fe+ i £ span(ri,...,r fe )) = p k . 

Combining the two inequalities, it follows that m'p m > < n, so p m > < n/m' as required. 

B Additional Examples 

B. l Online Multiclass Categorization with Bandit Feedback 

This example is based on [8]. It deals with another variant of the multi-armed bandit problem. It shows how to use 
exploration for injecting structure into the problem, which leads to a decrease in the required runtime. The price of 
the exploration is a larger regret, which corresponds to the need of a larger number of online rounds for achieving the 
same target error. 

The setting is as follows. At each online round, the learner first receives a vector x t £ R d and need to predict 
one of k labels (corresponding to arms). Then, the environment picks the correct label y t , without revealing it to the 
learner, and only tells the learner the binary feedback of if his prediction was correct or not. 

We analyze the number of mistakes the learner will perform in T rounds, where we assume that there exists some 
matrix W* € K fe,<i such that at each round the correct label is y t — &rgm&x yG [ k }(W*'x.)y. We further assume that the 
score of the correct label is higher than the runner-up by at least 7 and that the Frobenius norm of W* is at most 1. We 
also assume that d is order of 1 /j 2 (this is not restricting due to the possibility of performing random projections). 

We now consider two algorithms. The first uses a multiclass version of the Halving algorithm (see [8]) which can 
be implemented with the bandit feedback and has a regret bound of 0(k 2 d). However, the runtime of this algorithm 
is 2 kd . 

The second algorithm is the Banditron of [8]. The Banditron uses exploration for reducing the learning problem 
into the problem of learning multiclass classifier in the full information case, which can be performed efficiently using 
the Perceptron algorithm. In particular, in some of the rounds the Banditron guesses a random label, attempting to 
"fish" the relevant information. This exploration yields a higher regret bound of O(ykdT). 

We can therefore draw the following table, which shows a tradeoff between running time and number of rounds 
required to obtain regret < e. Note that we will usually want e to be much smaller than 1/k. 
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Rounds 


Time 


Inefficient alg. 
Efficient alg. 


k 2 d/e 
kd/e 2 


Tkd 



B.2 Sparse Principal Component Recovery 

This example is taken from [1]. This time, it is in the context of unsupervised statistical learning. 

The problem is as follows: we have an i.i.d. sample of vectors drawn from M d . The distribution is assumed to be 
Gaussian Af(0, £), with a "spiked" covariance structure, specifically, the covariance matrix £ is assumed to be of the 
form 1 \i + zz T , where z is an unknown sparse vector, with only k non-zero elements of the form ±1/ \fk. Our goal in 
this setting is to detect the support of z. 

[1] provide two algorithms to deal with this problem. The first method is a simple diagonal thresholding scheme, 
which takes the empirical covariance matrix £, and returns the k indices for which the diagonal entries of £ are largest. 
It is proven that if to > ck 2 log(<i— k) (for some constant c), then the probability of not perfectly identifying the support 
of z is at most cxp(— 0(k 2 \og(d — k))), which goes to with k and d. Thus, we can view the sample complexity 
of this algorithm as Oik 2 log(d — k)). In terms of running time, given a sample of size m = 0(k 2 \og(d — k)), the 
method requires computing the diagonal of £ and sorting it, for a total runtime of 0(k 2 d\og(d — k) + d\og(d)) = 
0(k 2 dlog(d)). 

The second algorithm is a more sophisticated semidefinite programming (SDP) scheme, which can be solved 
exactly in time 0(d 4 \og(d)). Moreover, the sample complexity for perfect recovery is shown to be asymptotically 
0(k \og(d— k)). Summarizing, we have the following clear sample-time complexity tradeoff. Note that here, the gaps 
are only polynomial. 





Samples 


Time 


SDP 


k\og(d — k) 


rf 4 log(rf) 


Thresholding 


k 2 \og(d - k) 


k 2 d\og{d) 
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